15 research outputs found
Explaining Explanation: An Empirical Study on Explanation in Code Reviews
Code review is an important process for quality assurance in software
development. For an effective code review, the reviewers must explain their
feedback to enable the authors of the code change to act on them. However, the
explanation needs may differ among developers, who may require different types
of explanations. It is therefore crucial to understand what kind of
explanations reviewers usually use in code reviews. To the best of our
knowledge, no study published to date has analyzed the types of explanations
used in code review. In this study, we present the first analysis of
explanations in useful code reviews. We extracted a set of code reviews based
on their usefulness and labeled them based on whether they contained an
explanation, a solution, or both a proposed solution and an explanation
thereof.
Based on our analysis, we found that a significant portion of the code review
comments (46%) only include solutions without providing an explanation. We
further investigated the remaining 54% of code review comments containing an
explanation and conducted an open card sorting to categorize the reviewers'
explanations. We distilled seven distinct categories of explanations based on
the expression forms developers used. Then, we utilize large language models,
specifically ChatGPT, to assist developers in getting a code review explanation
that suits their preferences. Specifically, we created prompts to transform a
code review explanation into a specific type of explanation. Our evaluation
results show that ChatGPT correctly generated the specified type of explanation
in 88/90 cases and that 89/90 of the cases have the correct explanation.
Overall, our study provides insights into the types of explanations that
developers use in code review and showcases how ChatGPT can be leveraged during
the code review process to generate a specific type of explanation
APISENS- Sentiment Scoring Tool for APIs with Crowd-Knowledge
Utilizing pre-existing software artifacts, such as libraries and Application
Programming Interfaces (APIs), is crucial for software development efficiency.
However, the abundance of artifacts that provide similar functionality can lead
to confusion among developers, resulting in a challenge for proper selection
and implementation. Through our preliminary investigation, we found that
utilizing the collective knowledge of a crowd can greatly assist developers in
acquiring a thorough and complete understanding of the complexities involved in
the software development process. Especially as emotions are an inseparable
part of human nature, it influences developers' activities. In this regard, we
attempt to build a tool that can retrieve sentiment information for software
APIs so that developers can determine APIs to utilize for their tasks. We
employ the dataset from the most popular platforms (i.e., Twitter and YouTube)
to build our research prototype. The source code, tool, and demo video are
available on GitHub at \url{https://github.com/FalconLK/APISens}
APIHarvest: Harvesting API Information from Various Online Sources
Using APIs to develop software applications is the norm. APIs help developers
to build applications faster as they do not need to reinvent the wheel. It is
therefore important for developers to understand the APIs that they plan to
use. Developers should also make themselves aware of relevant information
updates about APIs. In order to do so, developers need to find and keep track
of relevant information about the APIs that they are concerned with. Yet, the
API information is scattered across various online sources, which makes it
difficult to track by hand. Moreover, identifying content that is related to an
API is not trivial. Motivated by these challenges, in this work, we introduce a
tool named \tool that aims to ease the process of finding API information from
various online sources. \tool is built on works that link APIs or libraries to
various online sources. It supports finding API information on GitHub
repositories, Stack Overflow's posts, tweets, YouTube videos, and common
vulnerability and exposure (CVE) entries; and is extensible to support other
sources
Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues
In this paper, we systematically study the quality of 4,066 ChatGPT-generated
code implemented in two popular programming languages, i.e., Java and Python,
for 2,033 programming tasks. The goal of this work is three folds. First, we
analyze the correctness of ChatGPT on code generation tasks and uncover the
factors that influence its effectiveness, including task difficulty,
programming language, time that tasks are introduced, and program size. Second,
we identify and characterize potential issues with the quality of
ChatGPT-generated code. Last, we provide insights into how these issues can be
mitigated. Experiments highlight that out of 4,066 programs generated by
ChatGPT, 2,757 programs are deemed correct, 1,081 programs provide wrong
outputs, and 177 programs contain compilation or runtime errors. Additionally,
we further analyze other characteristics of the generated code through static
analysis tools, such as code style and maintainability, and find that 1,933
ChatGPT-generated code snippets suffer from maintainability issues.
Subsequently, we investigate ChatGPT's self-debugging ability and its
interaction with static analysis tools to fix the errors uncovered in the
previous step. Experiments suggest that ChatGPT can partially address these
challenges, improving code quality by more than 20%, but there are still
limitations and opportunities for improvement. Overall, our study provides
valuable insights into the current limitations of ChatGPT and offers a roadmap
for future research and development efforts to enhance the code generation
capabilities of AI models like ChatGPT
CHRONOS: Time-Aware Zero-Shot Identification of Libraries from Vulnerability Reports
Tools that alert developers about library vulnerabilities depend on accurate,
up-to-date vulnerability databases which are maintained by security
researchers. These databases record the libraries related to each
vulnerability. However, the vulnerability reports may not explicitly list every
library and human analysis is required to determine all the relevant libraries.
Human analysis may be slow and expensive, which motivates the need for
automated approaches. Researchers and practitioners have proposed to
automatically identify libraries from vulnerability reports using extreme
multi-label learning (XML).
While state-of-the-art XML techniques showed promising performance, their
experiment settings do not practically fit what happens in reality. Previous
studies randomly split the vulnerability reports data for training and testing
their models without considering the chronological order of the reports. This
may unduly train the models on chronologically newer reports while testing the
models on chronologically older ones. However, in practice, one often receives
chronologically new reports, which may be related to previously unseen
libraries. Under this practical setting, we observe that the performance of
current XML techniques declines substantially, e.g., F1 decreased from 0.7 to
0.24 under experiments without and with consideration of chronological order of
vulnerability reports.
We propose a practical library identification approach, namely CHRONOS, based
on zero-shot learning. The novelty of CHRONOS is three-fold. First, CHRONOS
fits into the practical pipeline by considering the chronological order of
vulnerability reports. Second, CHRONOS enriches the data of the vulnerability
descriptions and labels using a carefully designed data enhancement step.
Third, CHRONOS exploits the temporal ordering of the vulnerability reports
using a cache to prioritize prediction of...Comment: Accepted to the Technical Track of ICSE 202
NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python
Machine learning (ML) has gained much attention and been incorporated into
our daily lives. While there are numerous publicly available ML projects on
open source platforms such as GitHub, there have been limited attempts in
filtering those projects to curate ML projects of high quality. The limited
availability of such a high-quality dataset poses an obstacle in understanding
ML projects. To help clear this obstacle, we present NICHE, a manually labelled
dataset consisting of 572 ML projects. Based on evidences of good software
engineering practices, we label 441 of these projects as engineered and 131 as
non-engineered. This dataset can help researchers understand the practices that
are followed in high-quality ML projects. It can also be used as a benchmark
for classifiers designed to identify engineered ML projects.Comment: Accepted by MSR 202
Multi-Granularity Detector for Vulnerability Fixes
With the increasing reliance on Open Source Software, users are exposed to
third-party library vulnerabilities. Software Composition Analysis (SCA) tools
have been created to alert users of such vulnerabilities. SCA requires the
identification of vulnerability-fixing commits. Prior works have proposed
methods that can automatically identify such vulnerability-fixing commits.
However, identifying such commits is highly challenging, as only a very small
minority of commits are vulnerability fixing. Moreover, code changes can be
noisy and difficult to analyze. We observe that noise can occur at different
levels of detail, making it challenging to detect vulnerability fixes
accurately.
To address these challenges and boost the effectiveness of prior works, we
propose MiDas (Multi-Granularity Detector for Vulnerability Fixes). Unique from
prior works, Midas constructs different neural networks for each level of code
change granularity, corresponding to commit-level, file-level, hunk-level, and
line-level, following their natural organization. It then utilizes an ensemble
model that combines all base models to generate the final prediction. This
design allows MiDas to better handle the noisy and highly imbalanced nature of
vulnerability-fixing commit data. Additionally, to reduce the human effort
required to inspect code changes, we have designed an effort-aware adjustment
for Midas's outputs based on commit length. The evaluation results demonstrate
that MiDas outperforms the current state-of-the-art baseline in terms of AUC by
4.9% and 13.7% on Java and Python-based datasets, respectively. Furthermore, in
terms of two effort-aware metrics, EffortCost@L and Popt@L, MiDas also
outperforms the state-of-the-art baseline, achieving improvements of up to
28.2% and 15.9% on Java, and 60% and 51.4% on Python, respectively
BugsInPy: A database of existing bugs in Python programs to enable controlled testing and debugging studies
Lee Kuan Yew Fellowship, Singapore Management Universit